{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Geometry Aware Inductive Matrix Completion (GeoIMC)\n",
    "\n",
    "GeoIMC is an inductive matrix completion algorithm based on the works by Jawanpuria et al. (2019)\n",
    "\n",
    "Consider the case of MovieLens-100K (ML100K), Let $X \\in R^{m \\times d_1}, Z \\in R^{n \\times d_2} $ be the features of users and movies respectively. Let $M \\in R^{m \\times n}$, be the partially observed ratings matrix. GeoIMC models this matrix as $M = XUBV^TZ^T$, where $U \\in R^{d_1 \\times k}, V \\in R^{d_2 \\times k}, B \\in R^{k \\times k}$ are Orthogonal, Orthogonal, Symmetric Positive-Definite matrices respectively. This Optimization problem is solved by using Pymanopt.\n",
    "\n",
    "\n",
    "This notebook provides an example of how to utilize and evaluate GeoIMC implementation in **reco_utils**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "import tempfile\n",
    "import zipfile\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import papermill as pm\n",
    "import scrapbook as sb\n",
    "sys.path.append(\"../../\")\n",
    "sys.path.append(\"../../reco_utils/recommender/geoimc/\")\n",
    "\n",
    "from reco_utils.dataset import movielens\n",
    "from reco_utils.recommender.geoimc.geoimc_data import ML_100K\n",
    "from reco_utils.recommender.geoimc.geoimc_algorithm import IMCProblem\n",
    "from reco_utils.recommender.geoimc.geoimc_predict import Inferer\n",
    "from reco_utils.evaluation.python_evaluation import (\n",
    "    rmse, mae\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Choose the MovieLens dataset\n",
    "MOVIELENS_DATA_SIZE = '100k'\n",
    "# Normalize user, item features\n",
    "normalize = True\n",
    "# Rank (k) of the model\n",
    "rank = 300\n",
    "# Regularization parameter\n",
    "regularizer = 1e-3\n",
    "\n",
    "# Parameters for algorithm convergence\n",
    "max_iters = 150000\n",
    "max_time = 1000\n",
    "verbosity = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Download ML100K dataset and features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 4.81k/4.81k [00:09<00:00, 519KB/s]\n"
     ]
    }
   ],
   "source": [
    "# Create a directory to download ML100K\n",
    "dp = tempfile.mkdtemp(suffix='-geoimc')\n",
    "movielens.download_movielens(MOVIELENS_DATA_SIZE, f\"{dp}/ml-100k.zip\")\n",
    "with zipfile.ZipFile(f\"{dp}/ml-100k.zip\", 'r') as z:\n",
    "    z.extractall(dp)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Load the dataset using the example features provided in helpers\n",
    "\n",
    "The features were generated using the same method as the work by Xin Dong et al. (2017)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = ML_100K(\n",
    "    normalize=normalize,\n",
    "    target_transform='binarize'\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.load_data(f\"{dp}/ml-100k/\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Characteristics:\n",
      "\n",
      "              target: (943, 1682)\n",
      "              entities: (943, 1822), (1682, 1925)\n",
      "\n",
      "              training: (80000,)\n",
      "              training_entities: (943, 1822), (1682, 1925)\n",
      "\n",
      "              testing: (20000,)\n",
      "              test_entities: (943, 1822), (1682, 1925)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(f\"\"\"Characteristics:\n",
    "\n",
    "              target: {dataset.training_data.data.shape}\n",
    "              entities: {dataset.entities[0].shape}, {dataset.entities[1].shape}\n",
    "\n",
    "              training: {dataset.training_data.get_data().data.shape}\n",
    "              training_entities: {dataset.training_data.get_entity(\"row\").shape}, {dataset.training_data.get_entity(\"col\").shape}\n",
    "\n",
    "              testing: {dataset.test_data.get_data().data.shape}\n",
    "              test_entities: {dataset.test_data.get_entity(\"row\").shape}, {dataset.test_data.get_entity(\"col\").shape}\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Initialize the IMC problem"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(10)\n",
    "prblm = IMCProblem(\n",
    "    dataset.training_data,\n",
    "    lambda1=regularizer,\n",
    "    rank=rank\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimizing...\n",
      "Terminated - max time reached after 1753 iterations.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Solve the Optimization problem\n",
    "prblm.solve(\n",
    "    max_time,\n",
    "    max_iters,\n",
    "    verbosity\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize an inferer\n",
    "inferer = Inferer(\n",
    "    method='dot'\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Predict using the parametrized matrices\n",
    "predictions = inferer.infer(\n",
    "    dataset.test_data,\n",
    "    prblm.W\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prepare the test, predicted dataframes\n",
    "user_ids = dataset.test_data.get_data().tocoo().row\n",
    "item_ids = dataset.test_data.get_data().tocoo().col\n",
    "test_df = pd.DataFrame(\n",
    "    data={\n",
    "        \"userID\": user_ids,\n",
    "        \"itemID\": item_ids,\n",
    "        \"rating\": dataset.test_data.get_data().data\n",
    "    }\n",
    ")\n",
    "predictions_df = pd.DataFrame(\n",
    "    data={\n",
    "        \"userID\": user_ids,\n",
    "        \"itemID\": item_ids,\n",
    "        \"prediction\": [predictions[uid, iid] for uid, iid in list(zip(user_ids, item_ids))]\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "RMSE: 0.496351244012414\n",
      "MAE: 0.47524594431584\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Calculate RMSE\n",
    "RMSE = rmse(\n",
    "    test_df,\n",
    "    predictions_df\n",
    ")\n",
    "# Calculate MAE\n",
    "MAE = mae(\n",
    "    test_df,\n",
    "    predictions_df\n",
    ")\n",
    "print(f\"\"\"\n",
    "RMSE: {RMSE}\n",
    "MAE: {MAE}\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sb.glue(\"rmse\", RMSE)\n",
    "sb.glue(\"mae\", MAE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "[1] Pratik Jawanpuria, Arjun Balgovind, Anoop Kunchukuttan, Bamdev Mishra. _[Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach](https://www.mitpressjournals.org/doi/full/10.1162/tacl_a_00257)_. Transaction of the Association for Computational Linguistics (TACL), Volume 7, p.107-120, 2019.\n",
    "\n",
    "[2] Xin Dong, Lei Yu, Zhonghuo Wu, Yuxia Sun, Lingfeng Yuan, Fangxi Zhang. [A Hybrid Collaborative Filtering Model withDeep Structure for Recommender Systems](https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14676/13916).\n",
    "Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), p.1309-1315, 2017."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python (reco)",
   "language": "python",
   "name": "reco_base"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}